Categorical Predictors

Statistics for Data Science II

Introduction

  • Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon

  • Until now, we have discussed continuous predictors.

  • Now we will introduce the use of categorical, or qualitative, predictors.

  • This means that we will include predictors that categorize the observations.

    • We can assign numbers to the categories, however, the numbers are nominal.

Categorical Variables

  • If there are c classes in a categorical predictor, we include c-1 in the model.

    • e.g., undergraduate status: freshman, sophomore, junior, senior

    • e.g., AKC-recognized pug color: fawn, black

  • We will create indicator variables to include in our model.

  • The c-1 predictors included in the model will be binary indicators for category. x_i = \begin{cases} 1 & \textnormal{if category $i$} \\ 0 & \textnormal{if another category} \end{cases}

  • In the undergraduate status example, we could create four indicator variables, but we would only include three in our model.

  • While we will call them indicator variables, you will more often see them called dummy variables.

Data about Bee Colonies

library(tidyverse)
colony <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
head(colony, n=5)
colony %>% count(months)

Creating Indicator Variables

  • We could use mutate() and if_else() to define indicator variables.

  • However, if we do not need to condense categories, the dummy_cols() function from the fastDummies package works well.

library(fastDummies)
dataset <- dataset %>% # overwrite dataset
  dummy_cols(select_columns = c("var1", "var2", ..., "varQ")) # add indicators for the variables listed
  • Note that this will also create a column that indicates if the value is missing or not.

    • This variable must not be included in the model.

Creating Indicator Variables

  • Let’s create indicator variables for the months variable.
library(fastDummies)
colony <- colony %>% # overwrite data
  dummy_cols(select_columns = c("months")) # create indicators for months
colony %>% 
  select(starts_with("months")) %>% # limit output to columns related to months
  head(n=3) # show first 3 obs
library(fastDummies)
colony <- colony %>% # overwrite data
  mutate(quarter = case_when(months == "January-March" ~ "Q1", # convert to quarters
                             months == "April-June" ~ "Q2",
                             months == "July-September" ~ "Q3",
                             months == "October-December" ~ "Q4")) %>%
  dummy_cols(select_columns = c("quarter")) # create indicators for quarter
colony %>% 
  select(starts_with("quarter")) %>% # limit output to columns related to months
  head(n=3) # show first 3 obs

Modeling with Categorical Predictors

  • We represent a categorical variable with c classes with c-1 indicator variables in a model.

  • The last indicator variable not included is called the reference group.

  • How do we choose a reference group?

    • It depends on the story being told / what is of interest.

    • It does not affect the usefulness of the model, only the interpretations.

Modeling with Categorical Predictors

  • Suppose we are interested in modeling the number of colonies lost as a function of the year, quarter, and total number of colonies.

  • What happens when we do not use indicator variables?


Call:
glm(formula = colony_lost ~ quarter + year + colony_max, data = colony)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.279e+05  2.204e+05   1.942   0.0524 .  
quarterQ2   -4.058e+03  5.762e+02  -7.042 3.26e-12 ***
quarterQ3    3.302e+01  5.784e+02   0.057   0.9545    
quarterQ4   -5.622e+02  5.785e+02  -0.972   0.3314    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
  (72 observations deleted due to missingness)
AIC: 23641

Number of Fisher Scoring iterations: 2
m1 <- glm(colony_lost ~ quarter + year + colony_max, data = colony)
summary(m1)

Modeling with Categorical Predictors

  • We can see that R automatically treated them as categorical predictors, selecting a reference group for us.
coefficients(m1)
  (Intercept)     quarterQ2     quarterQ3     quarterQ4          year 
 4.279260e+05 -4.057731e+03  3.301536e+01 -5.621519e+02 -2.119223e+02 
   colony_max 
 1.166717e-01 
  • R uses the “first group” as the reference group.

    • In this case, it sees “Q1” as the “first” group because “1” is first numerically.
  • It is fine to do this for “ease” - if I am doing preliminary modeling, I do not always use indicator variables.

  • Warning: If you have a column stored as numeric, this method does not work.

    • R will treat it as a continuous varaible, which is not always correct.

    • User beware! Always be aware of how categorical variables are stored!

Modeling with Categorical Predictors

  • Circling back to indicator variables, let’s construct our model using those.

    • First, we need to see what their names are.
colony %>% 
  select(starts_with("quarter")) %>% # limit output to columns related to months
  head(n=5) # show first 3 obs

Modeling with Categorical Predictors

  • Using those variables in the model, we get the same results when Q1 is the reference.

Call:
glm(formula = colony_lost ~ quarter_Q2 + quarter_Q3 + quarter_Q4 + 
    year + colony_max, data = colony)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.279e+05  2.204e+05   1.942   0.0524 .  
quarter_Q2  -4.058e+03  5.762e+02  -7.042 3.26e-12 ***
quarter_Q3   3.302e+01  5.784e+02   0.057   0.9545    
quarter_Q4  -5.622e+02  5.785e+02  -0.972   0.3314    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
  (72 observations deleted due to missingness)
AIC: 23641

Number of Fisher Scoring iterations: 2
m2 <- glm(colony_lost ~ quarter_Q2 + quarter_Q3 + quarter_Q4 + year + colony_max, data = colony)
summary(m2)

\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

Modeling with Categorical Predictors

  • Maybe we are interested in using Q4 (the “holiday season”) as the reference.

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.274e+05  2.203e+05   1.940   0.0526 .  
quarter_Q1   5.622e+02  5.785e+02   0.972   0.3314    
quarter_Q2  -3.496e+03  5.987e+02  -5.839 6.84e-09 ***
quarter_Q3   5.952e+02  5.977e+02   0.996   0.3196    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
  (72 observations deleted due to missingness)
AIC: 23641

Number of Fisher Scoring iterations: 2
m2 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony)
summary(m2)

\hat{y} = 427363.8 + 562.16 x_{\text{Q1}} - 3495.58 x_{\text{Q2}} + 595.17 x_{\text{Q3}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

Testing Categorical Predictors

  • We are interested in knowing if the overall categorical variable is a significant predictor of the outcome.

    • If c = 2, we can use the results from summary().

    • If c \ge 3, we must request the Type III ANOVA.

      • Note: there are three types of sums of squares going into ANOVA. We are usually only interested in two:

        • Type I: variables are tested in the order that they enter the model.
        • Type III: variables are tested after adjusting for all other variables in the model.
  • We will use Anova() from the car package as follows:

car::Anova(model_results, type = 3) 
  • Note: we do not call the car package directly as it has conflicts with the tidyverse package.

Testing Categorical Predictors

In theory:

  • Hypotheses

    • H_0: \beta_{1}^* = ... = \beta_{s}^* = 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = ... = \beta_{s}^*x_s^* + \varepsilon

    • H_1: at least one \beta_i^* \ne 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = ... = \beta_{s}^*x_s^* + \varepsilon

  • Test Statistic and p-Value

    • F_0 = \frac{\left[ \text{SSReg(full)} - \text{SSReg(reduced)} \right]/s}{\text{MSE(full)}}

    • p = P(F_{s, n-q-s} \ge F_0)

  • Rejection Region

    • Reject H_0 if p < \alpha

Testing Categorical Predictors

  • Annotations from another lifetime:

Testing Categorical Predictors

In practice:

  • Hypotheses

    • H_0: \beta_{1} = \beta_{2} = ... = \beta_{c-1} = 0 in the specified model

    • H_1: at least one \beta_i \ne 0 in the specified model

  • Test Statistic and p-Value

    • \chi^2_0 and p from car::Anova(model_results, type = 3)^*
  • Rejection Region

    • Reject H_0 if p < \alpha

^* - note that the Anova() function produces the likelihood ratio test, which gives a \chi^2 statistic.

Testing Categorical Predictors

  • Let’s look at this in our example,
car::Anova(m1, type = 3)
  • Both quarter (p<0.001) and total number of colonies (p<0.001). are significant predictors of the number of lost colonies.

  • The year (p=0.052) is not a significant predictor of the number of lost colonies.

Testing Categorical Predictors

  • Hypotheses

    • H_0: \beta_{\text{Q2}} = \beta_{\text{Q3}} = \beta_{\text{Q4}} = 0 in the specified model

    • H_1: at least one \beta_i \ne 0 in the specified model

  • Test Statistic and p-Value

    • \chi^2_0 = 65.6

    • p < 0.001

  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05
  • Conclusion/Interpretation

    • Reject H_0.

    • There is sufficient evidence to suggest that there is a relationship between number of bee colonies lost and the quarter of the year.

Testing Categorical Predictors

  • Note: reference group does not affect overall significance of the term.
m2_a <- glm(colony_lost ~ quarter + year + colony_max, data = colony)
car::Anova(m2_a, type = 3)
m2_b <- glm(colony_lost ~ quarter + year + colony_max, data = colony)
car::Anova(m2_b, type = 3)

Interpreting Coefficients

  • Interpretations of categorical predictors are slightly different than continuous predictors.

    • For continuous variables, we interpret in terms of a 1-unit increase in x_i.

    • What is a “1-unit” increase of an indicator variable?

      • Going from x_i=0 to x_i=1
  • \beta_i for a categorical variable is the average difference between groups x_i=0 and x_i=1.

    • Suppose \hat{\beta}=-10 when modeling systolic blood pressure as a function of high school graduation (yes=1, no=0).

    • Those that are high school graduates (x=1, group of interest) have a systolic blood pressure 10 mmHg lower than those that did not graduate high school (x=0, reference group).

Interpreting Coefficients

  • In our example,

\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

  • There are 4058 fewer colonies lost in Q2 as compared to Q1.

  • There are 33 more colonies lost in Q3 as compared to Q1.

  • There are 562 fewer colonies lost in Q4 as compared to Q1.

  • For every year increase in time, 212 fewer colonies are lost.

  • For every 1 colony increase in the total number of colonies, 1/10 of a colony is expected to be lost.

\hat{y} = 427363.8 + 562.16 x_{\text{Q1}} - 3495.58 x_{\text{Q2}} + 595.17 x_{\text{Q3}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

  • There are 562 more colonies lost in Q1 as compared to Q4.

  • There are 3496 fewer colonies lost in Q2 as compared to Q4.

  • There are 595 more colonies lost in Q3 as compared to Q4.

  • For every year increase in time, 212 fewer colonies are lost.

  • For every 1 colony increase in the total number of colonies, 1/10 of a colony is expected to be lost.

Testing Categorical Predictors

  • Now that we understand how to interpret, let’s revisit testing.

  • We’ve already discussed testing the overall significance of the term - this is called global significance.

  • Now, let’s focus on the individual terms.

    • This information comes from summary().
  • The individual tests tell us if there is a significant difference between the group of interest and the reference group.

    • e.g., if modeling systolic blood pressure as a function of high school education, we would be testing the difference in systolic blood pressure between those that did and did not graduate high school.
  • We are not always interested in the individual tests…

    • Sometimes we are interested in multiple tests…
  • Note: This is essentially post-hoc testing. We should adjust our \alpha!

    • I like to take the Bonferroni approach: \alpha_\text{B} = \alpha/k.

Visualizing the Model

  • Visualization of the model is more of an art than a science… but I still have guidelines.

  • The generic visualization for a model is to have a scatterplot of the data with a regression line (or multiple lines) overlaid.

  • Note that the outcome (y) will always be specified on the y-axis.

  • We will decide on a continuous predictor (one x_i) to vary on the x-axis.

  • Then we will create predicted values by holding all other variables in the model constant.

Visualizing the Model

  • Recall our example,

\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

  • We will let year vary - this will show us what happens over time.

  • We will create a line for each quarter.

    • Q1: x_{\text{Q2}}=0, x_{\text{Q3}}=0, x_{\text{Q4}}=0
    • Q2: x_{\text{Q2}}=1, x_{\text{Q3}}=0, x_{\text{Q4}}=0
    • Q3: x_{\text{Q2}}=0, x_{\text{Q3}}=1, x_{\text{Q4}}=0
    • Q4: x_{\text{Q2}}=0, x_{\text{Q3}}=0, x_{\text{Q4}}=1
  • We will plug in the median number of total colonies.

median(colony$colony_max, na.rm = TRUE)
[1] 21000

Visualizing the Model

\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}

  • Putting this into mutate(),
# base:  427926.00 - 4057.73*x_Q2 + 33.02*x_Q3 - 562.15*x_Q4 - 211.92*year + 0.12*21000

colony <- colony %>%
  mutate(p_q1 = 427926.00 - 4057.73*0 + 33.02*0 - 562.15*0 - 211.92*year + 0.12*21000,
         p_q2 = 427926.00 - 4057.73*1 + 33.02*0 - 562.15*0 - 211.92*year + 0.12*21000,
         p_q3 = 427926.00 - 4057.73*0 + 33.02*1 - 562.15*0 - 211.92*year + 0.12*21000,
         p_q4 = 427926.00 - 4057.73*0 + 33.02*0 - 562.15*1 - 211.92*year + 0.12*21000)

Visualizing the Model

  • Note that I never construct the “final” graph on the first try - it always takes multiple edits/tweaks
colony %>% 
  ggplot(aes(x = year, y = colony_lost)) +
  geom_point() +
  geom_line(aes(y = p_q1), color = "red") +
  geom_line(aes(y = p_q4), color = "orange") +
  theme_bw()

colony %>% 
  filter(colony_lost < 5000) %>%
  ggplot(aes(x = year, y = colony_lost)) +
  geom_point() +
  geom_line(aes(y = p_q1), color = "red") +
  geom_line(aes(y = p_q4), color = "orange") +
  theme_bw()

colony %>% 
  filter(colony_lost < 5000) %>%
  ggplot(aes(x = year, y = colony_lost)) +
  geom_point(color = "gray") +
  geom_line(aes(x = year, y = p_q1), color = "black", linetype = "dashed", size = 1) +
  annotate("text", x = 2021.2, y = 2175, label = "Q1", size=5) +
  geom_line(aes(x = year, y = p_q4), color = "black", linetype = "dotted", size = 1) +
  annotate("text", x = 2021.2, y = 1600, label = "Q4", size=5) +
  scale_x_continuous(breaks = 2015:2021,
    labels = paste0(c("2015", "2016", "2017", "2018", "2019", "2020", "2021"), "")) +
  labs(y = "Number of Colonies Lost",
       x = "Year") +
  theme_bw()

Wrap Up

  • In this lecture, we have introduced the concept of categorical predictors.

    • Always remember that if there are c categories, there will be c-1 predictor terms in the model.

    • When c = 2, we can use the test results from summary() to discuss significance.

    • When c \ge 3, we must use a partial F test (i.e., car::Anova())

    • When including for visualization, must plug in 0’s and 1’s - no means, medians, etc.!

      • We plug in only plausible values.